# Multimodal Document Retrieval
Colnomic Embed Multimodal 7b
Apache-2.0
ColNomic Embed Multimodal 7B is a state-of-the-art multi-vector multimodal embedding model, excelling in visual document retrieval tasks with support for multilingual and unified text-image encoding.
Multimodal Fusion Supports Multiple Languages
C
nomic-ai
7,909
45
Ret OpenCLIP ViT G 14
Apache-2.0
ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by integrating multi-level representations from visual and textual backbone networks.
Multimodal Fusion
Transformers

R
aimagelab
77
0
Ret OpenCLIP ViT H 14
Apache-2.0
ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by integrating multi-level representations from vision and text backbone networks.
Multimodal Fusion
Transformers

R
aimagelab
23
0
Ret CLIP ViT L 14
Apache-2.0
ReT is an innovative method supporting multimodal query and document retrieval, achieving fine-grained retrieval by fusing multi-level representations from vision and text backbone networks.
Multimodal Fusion
Transformers

R
aimagelab
523
0
Colqwen2.5 3b Multilingual V1.0
MIT
A multilingual visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, supporting dynamic input image resolution and multilingual document retrieval.
Text-to-Image Supports Multiple Languages
C
tsystems
13.29k
8
Colqwen2.5 3b Multilingual V1.0 Merged
MIT
A multilingual visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, supporting dynamic input image resolution and generating ColBERT-style multi-vector text and image representations.
Text-to-Image
Transformers Supports Multiple Languages

C
tsystems
70
0
Colqwen2.5 7b Multilingual V1.0
MIT
A multilingual visual retrieval model based on Qwen2.5-VL-7B-Instruct using the ColBERT strategy, ranked first in the Vidore benchmark
Text-to-Image Supports Multiple Languages
C
Metric-AI
4,699
7
Colqwen2.5 3b Multilingual V1.0
MIT
A multilingual visual retriever based on Qwen2.5-VL-3B-Instruct with ColBERT strategy, excelling in Vidore benchmark tests
Text-to-Image Supports Multiple Languages
C
Metric-AI
2,475
7
Colqwen2.5 V0.1
MIT
A visual retrieval model based on Qwen2.5-VL-3B-Instruct and ColBERT strategy, capable of generating multi-vector representations for text and images to enable efficient document retrieval.
Text-to-Image
Safetensors English
C
vidore
985
0
Colqwen2 7b V1.0
A visual retrieval model based on Qwen2-VL-7B-Instruct using ColBERT strategy, focusing on efficient visual feature indexing for documents
Text-to-Image Supports Multiple Languages
C
tsystems
172
8
Colqwen2 7b V1.0
A visual retrieval model based on Qwen2-VL-7B-Instruct and ColBERT strategy, supporting multi-vector text and image representation
Text-to-Image English
C
yydxlv
25
1
Colpali V1.3 Hf
ColPali is a vision-language model extended from PaliGemma-3B, capable of efficiently indexing documents through visual features and generating ColBERT-style multi-vector representations.
Text-to-Image
Transformers English

C
vidore
790
25
Visrag Ret
Apache-2.0
VisRAG is a retrieval-augmented generation (RAG) system based on vision-language models (VLM) that can directly embed documents as images, avoiding information loss caused by traditional text parsing.
Text-to-Image English
V
openbmb
1,294
65
Featured Recommended AI Models